16 research outputs found

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Computational analysis of gene expression data

    Get PDF
    Gene expression is central to the function of living cells. While advances in sequencing and expression measurement technology over the past decade has greatly facilitated the further understanding of the genome and its functions, the characterisation of functional groups of genes remains one of the most important problems in modern biology. Technological advancements have resulted in massive information output, with the priority objective shifting to development of data analysis methods. As such, a large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments, and consequently, confusion regarding the best approach to take. Common techniques applied are not necessarily the most applicable for the analysis of patterns in microarray data. This confusion is clarified through provision of a framework for the analysis of clustering technique and investigation of how well they apply to gene expression data. To this end, the properties of microarray data itself are examined, followed by an examination of the properties of clustering techniques and how well they apply to gene expression. Clearly, each technique will find patterns even if the structures are not meaningful in a biological context and these structures are not usually the same for different algorithms. Also, these algorithms are inherently biased as properties of clusters reflect built in clustering criteria. From these considerations, it is clear that cluster validation is critical for algorithm development and verification of results, usually based on a manual, lengthy and subjective exploration process. Consequently, it is key to the interpretation of the gene expression data. We carry out a critical analysis of current methods used to evaluate clustering results. Clusters obtained from real and synthetic datasets are compared between algorithms. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges between gene-sample node couples corresponding to significant expression measurements of interest. In this research, this method of representation is extensively studied and methods are used, in combination with probabilistic models, to develop new clustering techniques for analysis of gene expression data in this mode of representation. Performance of these techniques can be influenced both by the search algorithm, and, by the graph weighting scheme and both merit vigorous investigation. A novel edge-weighting scheme, based on empirical evidence, is presented. The scheme is tested using several benchmark datasets at various levels of granularity, and comparisons are provided with current a popular data analysis method used in the Bioinformatics community. The analysis shows that the new empirical based scheme developed out-performs current edge-weighting methods by accounting for the subtleties in the data through a data-dependent threshold analysis, and selecting ‘interesting’ gene-sample couples based on relative values. The graphical theme of gene expression analysis is further developed by construction of a one-mode gene expression network which specifically focuses on local interactions among genes. Classical network theory is used to identify and examine organisational properties in the resulting graphs. A new algorithm, GraphCreate, is presented which finds functional modules in the one-mode graph, i.e. sets of genes which are coherently expressed over subsets of samples, and a scoring scheme developed (using bi-partite graph properties as a basis) to weight these modules. Use of this representation is used to extensively study published gene expression datasets and to identify functional modules of genes with GraphCreate. This work is important as it advances research in the area of transcriptome analyiii sis, beyond simply finding groups of coherently expressed genes, by developing a general framework to understand how and when gene sets are interacting

    RNA-seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering

    Get PDF
    With the fast development of high-throughput sequencing technologies, a new generation of genome-wide gene expression measurements is under way. This is based on mRNA sequencing (RNA-seq), which complements the already mature technology of microarrays, and is expected to overcome some of the latter’s disadvantages. These RNA-seq data pose new challenges, however, as strengths and weaknesses have yet to be fully identified. Ideally, Next (or Second) Generation Sequencing measures can be integrated for more comprehensive gene expression investigation to facilitate analysis of whole regulatory networks. At present, however, the nature of these data is not very well understood. In this paper we study three alternative gene expression time series datasets for the Drosophila melanogaster embryo development, in order to compare three measurement techniques: RNA-seq, single-channel and dual-channel microarrays. The aim is to study the state of the art for the three technologies, with a view of assessing overlapping features, data compatibility and integration potential, in the context of time series measurements. This involves using established tools for each of the three different technologies, and technical and biological replicates (for RNA-seq and microarrays, respectively), due to the limited availability of biological RNA-seq replicates for time series data. The approach consists of a sensitivity analysis for differential expression and clustering. In general, the RNA-seq dataset displayed highest sensitivity to differential expression. The single-channel data performed similarly for the differentially expressed genes common to gene sets considered. Cluster analysis was used to identify different features of the gene space for the three datasets, with higher similarities found for the RNA-seq and single-channel microarray dataset

    MF BHI and cluster size for biclusters obtained in 10 runs.

    No full text
    <p>MF BHI and cluster size for biclusters obtained in 10 runs.</p

    Histogram showing the distribution of average count values (from the NGS dataset) for genes commonly DE in the NGS and SC datasets (6075 genes), versus those DE only in one dataset (2805 for NGS and 356 for SC).

    No full text
    <p>Only genes probed on both platforms were considered for this analysis. Uncommon genes display lower counts compared to common. The NGS dataset also identifies a few genes with very large counts.</p

    Differentially expressed genes for .

    No full text
    <p>The NGS (Next/Second Generation Sequencing) and SC (Single-Channel) datasets display the largest commonality, while the DC (Dual-Channel) and SC the smallest.</p

    Cluster size and BHI values for different .

    No full text
    <p>The colour intensity of the spots indicates the number of points falling in the specific area. The graphs show that the gene space is similar for the three datasets. For small , clusters do not have a large BHI, which changes with increasing , as more clusters become relevantly differentiated.</p

    Percentage of reference genes represented in the DE sets obtained from the three datasets.

    No full text
    <p>The NGS dataset identifies the largest number of reference genes, and the DC dataset the lowest.</p

    Cluster comparison for dataset pairs.

    No full text
    <p>The Adjusted Rand Index (ARI) is displayed for each dataset pair for all combinations of (top to bottom: DC vs SC, NGS vs SC, DC vs NGS). The clusters obtained from SC and NGS are more similar than when comparing the DC dataset with the other two.</p

    Bicluster average additive variance distribution over ten runs.

    No full text
    <p>Bicluster average additive variance distribution over ten runs.</p
    corecore